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Abstract 

Because ignoring the missing data in an evaluation may lead to results that are questionable, this 
study investigated the effects of use of four missing data handling techniques on a survey 
instrument. A questionnaire containing 35 five-point Likert style questions was completed by 384 
respondents. Of these, 166 (43%) questionnaires contained one or more missing responses. The 
missing data pattern was non-ignorable. Listwise deletion, pairwise deletion, regression, and the 
expectation maximizat ion algorithm were used to treat the missing data. Resulting data was then 
submitted to factor analysis and factor scores obtained. Factor scores for each group defined by 
missing data method were then contrasted by multivariate analysis of variance. Less than 1% of 
the variance in scores could be explained by group(F=.218, df=30, 3496, p=1.0). What if analyses 
were conducted by reducing the number of complete responses. 
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When conducting survey research, often the data analyzed contains missing values. If the 
cause of the missing values is known, the solution to this problem may be incorporated in the 
study. Seemingly unavoidably, however, survey subjects fail to answer some questions for 
unkno wn reasons. Ignoring this problem may threaten the quality of the study and data analysis. 

In addition, several researchers have shown that different methods of handling missing 
values may produce different results. Witta and Kaiser (1991) reported that the regression 
coefficients and total variance accounted for by the variables changed depending on the method 
used to handle missing values. After re-analyzing three studies of private/public school 
achievement. Ward and Clark HI (1991) concluded that the method used to handle missing data 
influenced the outcome of these studies. Mundfrom and Whitcomb (1998) found differences in 
correct classification of subjects based upon the missing data handling procedure used. 

Statement of the Problem 

The purpose of the current study was to examine the influence of missing data methods on 
factors created from a survey of high school student participants in an educational interactive 
video system. The missing data methods studied were listwise deletion, pairwise deletion, 
regression and expectation maximization. The sample consisted of 384 respondents. Of these, 166 
(43%) of the surveys contained one or more missing values. 

Until recently, the only methods available with popular statistical computer software 
focused on handling the missing data problem by deleting subjects with incomplete information, 
deleting the variables with missing values, or replacing the missing value with some reasonable 
estimate. Now, however, new subroutines are available to provide more assistance in handling 
missing data and providing analysis choices using iterative regression or expectation maximization 
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(EM) procedures. These relative new methods (in current software) also provide the possibility of 
specifying the model to be used (i.e., multivariate normality, adding a randomly selected error). 

Review of Literature 

Listwise Deletion 

Listwise deletion is probably the most frequently used method of handling missing data. 
This is usually the default option in statistical software programs. This method discards cases with 
a mis sing value on any variable and thus is very wasteful of data. Listwise deletion, however, has 
been shown to be effective with low average intercorrelation, less than four variables and a small 
proportion of missing values (Chan, et.al., 1976; Haitovsky, 1968; Timm, 1970). The assumption 
of missing completely at random is crucial to the use of this method. It is more likely, however, to 
find the complete sample different in important ways from the incomplete sample (Little & Rubin, 
1987). Problems for a researcher using this method include a reduction in power, an increase in 
standard error, and the elimination of sub-populations. 

Pairwise Deletion 

When using pairwise deletion, covariances are computed between all pairs of variables 
having both observations, eliminating those that have a missing value for one of the two variables 
(Glasser, 1964). Means and variances are computed on all available observations. It is assumed 
that the use of the maximum number of pairs and all the individual observations yield more valid 
estimates of the relationship between the variables. This method is sometimes referred to as a 
“complete data” method. It is also assumed that when two variables are correlated, information on 
one improves the estimates of the other variable. An additional assumption is that the pairs are a 
random subset of the sample pairs. If these assumptions are true, pairwise deletion produces 
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unbiased estimates of the variable means and variances (Hertel, 1976). When missing data are not 
missing completely at random, however, the correlation matrix produced by pairwise deletion may 
not be Gramian (Norusis, 1988). When Marsh (1998) investigated the estimates produced by 
pairwise deletion for randomly missing data, heconcluded parameter variability was explained, 
parameter estimates were unbiased, and only one covariance matrix was nonpositive definite. 
Regression 

The regression methods of handling missing data rely on information contained in non- 
missing values of other variables to provide estimates of missing values. As the correlation 
between variables and the number of variables increases, the regression methods, theoretically, 
perform better. Too many variables, however, can cause problems with over prediction (Kaiser & 
Tracy, 1988) and too high an average intercorrelation can result in a singular matrix. In these 
cases, regression does not perform well. 

Variations in the regression methods include differences in methods of developing the 
initial correlation matrix (listwise deletion, pairwise deletion, and mean substitution) and the 
presence or absence of iteration procedures. Differences in regression methods also include the 
use of randomly selected residuals for iterations and assumptions of a normal distribution. 
Mundffom and Whitcomb (1998) investigated the effects of using mean substitution, hot-deck 
imputation, and regression imputation on classification of cardiac patients. Mean substitution and 
hot-deck imputation correctly classified patients more frequently than regression imputation. 
Expectation Maximization 

Dempster, Laird, and Rubin (1977) recommended the use of the EM algorithm which 
imputes estimates simultaneously in an iterative procedure. The E step of this algorithm finds the 
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conditional expectation of the missing values. The M step performs maximum likelihood 
estimation as if there were no missing data. The primary difference between this procedure and 
the regression procedure is that the values for the missing data are not imputed and then iterated. 
The mis sing values are functions based on the conditional expectation (Little & Rubin, 1987). 

This method of handling missing data represents a fundamental shift in the way of thinking about 
missing data (Schafer & Olsen, 1998). 

Pattern of Missing Values 

All of the missing data handling procedures discussed require data missing at random 
(MAR) or missing completely at random (MCAR). Missing completely at random as described by 
Little and Rubin (1987) means that the probability of missingness is independent of that variable’s 
value and the value of any other variable in the data set. Missing at random as described by Little 
and Rubin (1987) means that the probability of missingness may depend on another variable, but 
not on the value of the variable itself. Yet Cohen and Cohen (1983) suggested that in survey 
research the absence of data on one variable may be related to another variable and may be due to 
the value of the variable itself. When investigating simultaneously missing values, Witta (1996/97) 
found concurrently missing values (p<.001 ) in three of four samples using data from a national 
database and Witta (2000a) found concurrently missing values (p<.001) in all four samples using 
data from a national database. Little’s test of missing completely at random is now included with 
the miss ing data subroutine of SPSS. Unfortunately, because missing at random requires 
knowledge of the true value of the missing data, “there is no magic test for MAR” (Hill, 1997, p. 
43). Schafer and Olsen (1998), however, argue convincingly that “every missing-data method 
must make some largely untestable statistical assumptions about the manner in which the missing 
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values were lost” (p551). Consequently, when analyzing real data, researchers typically assume 
missing at random. 

Method 

Subjects 

All high school students enrolled in an interactive video class at this facility during the 
Sp ring semester, 1998, were surveyed. Questionnaires containing 35 five-point Likert style 
questions, some demographic questions, and some open-ened questions were administered during 
the regularly scheduled class time by the class instructor or remote facilitator. This analysis used 
only the 35 Likert type questions. 

Of the 384 returned surveys, respondents represented 19 classes and 24 high schools. 
Fifty-four percent of the sample were female and 59% were from the home site. One hundred 
sixty-six (43%) questionnaires contained one or more missing responses. Regression and the 
expectation maximization algorithm were used to treat the missing data using the missing data 
subroutine 7.5 in SPSS 10.0. The resultant data for each procedure was saved in a new file. In 
addition, the pattern of missing data, and Little’s test of missing completely at random was 
recorded. The original data was replicated to create a data file for pairwise deletion analysis. 

Each of the four data sets was then submitted to principal components analysis in SPSS 
1 0.0. The factor scores produced by varimax rotation were saved with the data set. Finally, the 
four data sets were merged and factor scores for each group defined by missing data method were 
contrasted by multivariate analysis of variance. 

Results 

When data were examined in the missing data procedure in SPSS, the test of missing 
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completely at random was statistically significant (X 2 - 3292.352, df = 2910, p < .001). To 
examine the pattern of missing values, individual categories were created for any variable having 
4 or cases with missing values. All other variables containing a missing value, but having less than 
4 (<1 %) cases with a missing value were categorized as “Other”. Over half of the incomplete 
cases were classified as “Other” or produced by missing values on an assortment of variables as is 
depicted in Figure 1 . Thus, visual examination of the pattern of missing values did not reveal any 
obvious patterns of missing values. 



Insert Figure 1 About Here 



After treatment by a missing data method, the resultant data sets from the EM algorithm 
and and regression were analyzed using principal components analysis (PCA) with varimax 
rotation. The pairwise and listwise deletion data sets were analyzed in the same procedure using 
the pairwise and listwise missing setting in PCA. Using eigenvalues of 1 or larger, all data sets 
produced 1 0 factors. Although there were differences in the order in which factors were listed 
and, consequently, the variance explained by each, the factors produced were essentially the same 
In addition, the factor loadings were very similar as is depicted in Table 1. 



Insert Table 1 About Here 



The mean vector of factor scores produced by each missing data handling treatment were 
then contrasted by multivariate analysis in SPSS 10.0. Less than 1% of the variance in the mean 
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vector of factor scores could be explained by group (F= 218, df=30, 3496, p=1.0). 

To provide an estimate of what if 50% of the cases contained missing values, 166 
complete cases were randomly sampled from the 218 complete cases and merged with the 166 
incomplete cases providing a data set of 332 cases. The previous procedure (treatment by missing 
data method, principal components analysis, varimax rotation, and creation of factor scores) was 
repeated. When tested for missing completely at random Little's MCAR test (X 2 = 3303.005, df = 
2994, p < .001) again yielded statistically significant results. As anticipated the cases were not 
missing completely at random. When the factor mean vectors were contrasted, less than 1% of the 
variance in scores was explained by missing data treatment group (F=.236, df 30, 2945, p=1.0). 

For the final what-if analysis, 100 complete cases were randomly sampled from the 218 
complete cases and merged with the 166 incomplete cases providing a data set of 266 cases. 
Missing data treatments were again applied and factor scores saved. Little’s MCAR test(x 2 = 

3 176.278, df = 2994, p = .01 0 ) was once more statistically significant. In this instance, however, 
only 9 factors emerged from the data. Again less than 1% of the variance in the mean vector of 
factor scores could be explained by missing data treatment group (F=.26, df=27, 2103, p=1.0). 

Discussion and Conclusion 

In all cases used in this study there were no statistically significant differences in factor 
score mean vectors produced by the missing value treatments even though the pattern of missing 
values was not missing completely at random. In addition, the differences by missing value 
treatment group explained less than 1% of the variance in mean vectors. Based on these results, 
the choice of missing value treatment can be based upon consequences of loss of power by loss of 
cases or other data handling considerations. 
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There was, however, a difference in the pattern of missing values in this study and those 
observed in other studies. In a study of missing values using a sample from the NELS 88 
(National Educational Longitudinal Study of 1988), Witta (2000b) detected differences in missing 
data handling treatments. The pattern of missing values in that study was also not missing 
completely at random. The observed pattern, however, differed from the current study. In Witta’ s 
(2000b) study, the category “Other” was less than 35% of the incomplete cases under all 
conditions. A few variables, specifically standardized test score, accounted for the remaining 65% 
of the cases. In the current study, the category “Other” represented more than 50% of the 
incomplete cases. This category consisted of variables with less than 1% of the values missing. It 
appears, therefore, that the effectiveness of the missing value methods are seriously affected by 
the pattern of missing values. Because there is no test for ‘Missing at Random’, an assumption is 
made that this condition is satisfied when testing effectiveness of missing data method. In the 
current study, this assumption appears to be satisfied. In Witta’ s (2000b) study, that assumption 
appeared to be violated. As a result, it is strongly recommenced that all studies first investigate the 
pattern of missing values. It is further suggested that future research investigate the patterns of 
mis sing values in an attempt to determine when it would be acceptable to assume ‘Missing at 
Random’. 
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